Scalable Distributed Consensus to Support MPI Fault Tolerances

نویسنده

  • Darius Buntinas
چکیده

1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FTMPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation alt...

متن کامل

MPIgnite: An MPI-Like Language and Prototype Implementation for Apache Spark

Scale-out parallel processing based on MPI is a 25-year-old standard with at least another decade of preceding history of enabling technologies in the High Performance Computing community. Newer frameworks such as MapReduce, Hadoop, and Spark represent industrial scalable computing solutions that have received broad adoption because of their comparative simplicity of use, applicability to relev...

متن کامل

The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications

Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support ...

متن کامل

High Performance Broadcast Support in La-Mpi Over Quadrics

LA-MPI is a unique MPI implementation that provides network-level fault-tolerant message passing. This paper describes the efficient implementation of a scalable MPI broadcast algorithm. LA-MPI implements a generic version of the broadcast algorithm using a spanning tree method built on top of point-to-point messaging. However, the Quadrics network, with it’s hardware broadcast support, provide...

متن کامل

Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof

The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011